Primary Models
Three models are fitted to the data, each with its own strengths, chosen to mask the weaknesses of the others. K-nearest neighbours with kernel smoothing is used to capture local similarity not only geospatially but throughout the entire feature space. Gradient boosted descision trees are used for their robustness and successful track record. Finally a generalised additive model is fit, fitting first splines to the features and then taking a linear combination of these nonlinear functions.
5 fold cross validation is used. 10 fold was used at first, but was dropped to 5 to reduce variance in score during validation. This likely increases bias of the validation score, (overestimating error), but accuracy estimation is not the goal of this part of the analysis.
Category Encodings
Before fitting, it is useful for tree and neighbourhood models to encode the factor variable type to a numeric. They are encoded in order of their mean price. When doing target encoding, it is often wise to regularise the ordering by using kfold or expanding mean encodings, but in this case, with only three levels and an obvious separation in price, … talk about not best practices?
The method faeture tells a similar story, details are omitted.
ggplot(property_data,aes(x=type,y=log(price))) + geom_boxplot(fill='#0078D7')
encode_type <- function(df) {
df$type_encoded = revalue(df$type,replace=c('Unit'='1','Townhouse'='2','House'='3')) %>% as.numeric
return (df)
}
encode_method <- function(df) {
df$method_encoded = revalue(df$method,replace=c('SP'='1','SA'='2','S'='3')) %>% as.character %>% as.numeric
return (df)
}
Gradient Boosted Trees
The model fitting begins with gradient boosted descision trees (GBDT), implemented with the xgboost package, an R api for the popular software library of the same name. GBDT is perhaps the most widely applicable and robust machine learning technique. It is able to capture interactions and non-linear trends without explicit encoding. For more information on xgboost, you can find the xgboost documentation here
Polar Coordinates
Looking at the map above, it appears that property value may be better parameterised with polar coordinates. Let’s set this up now with a function:
#----compute distance from city and bearings------------------------------------
polarise <- function(lnglat_centre,df) {
locs = select(df,c(lng,lat))
df$dist_cbd = apply(locs,1,function(loc){distm(loc,lnglat_centre)})
df$bearing =apply(select(df,c(lng,lat)),1,function(x){
bearing(lnglat_centre,x)
})
return(df)
}
A variety of features were experimented with, ultimately only a small subset were chosen for the model.
Tuning the number of trees was performed via early stopping, once the validation score stops improving and starts to increase, training is stopped. Other hyperparameters were tuned manually, such as maximum tree depth max_depth and observation and feature subsampling ratios subsample and colsample_bytree. After these were optimised, the learning rate was decreased in order to get a better fit.
The graphic below shows the score as a function of iterations during training, with the validation score in orange, training in blue. The dashed lines show one standard deviation below the mean training/validation score over the folds.
#---------------------prepare data for xgboost api---------------------------
xgb_train0 <- encode_type(train0) %>% encode_method
#add polar coordinates
MELBOURNE_CENTRE <- c(144.9631,-37.8136)
xgb_train0 <- polarise(MELBOURNE_CENTRE,xgb_train0)
xgb_features <- c(
'building_area'
,'dist_cbd'
,'bearing'
,'year_built'
,'type_encoded'
,'nrooms'
,'land_area'
,'method_encoded'
,'nbathroom'
,'ncar'
,'ntrain_3000'
)
xgb_train0_y <- log(xgb_train0$price)
xgb_train0_x <- xgb_train0 %>% select(xgb_features)
#prepare the data in an xgboost specific data structure
Dtrain0 <- xgb.DMatrix(data=as.matrix(xgb_train0_x),label=xgb_train0_y)
#------------------------ setup parameters--------------------------------------
PARAMS <- list(
seed=0,
objective='reg:linear',
eta=0.005,
eval_metric='rmse',
max_depth=7,
colsample_bytree=0.78,
subsample=0.80
)
#------------------------cross validation---------------------------------------
set.seed(45785)
cv_obj <- xgb.cv(
params=PARAMS,
data=Dtrain0,
nthreads = 4,
folds=folds,
verbose=FALSE,
nrounds=1e+6,
early_stopping_rounds=2000
)
#save the number of trees that gives the best out of fold generalisation
OPTIMAL_NROUNDS <- cv_obj$best_ntreelimit
#-----------------------------plot----------------------------------------------
#plot training information
eval_log <- cv_obj$evaluation_log
ggplot(eval_log,aes(x=iter)) +
coord_cartesian(ylim=c(0,0.4)) + xlab('iteration') + ylab('score') +
ggtitle('Validating Xgboost model') +
geom_line(aes(y=train_rmse_mean),lwd=1,col='#0078D7') +
geom_line(aes(y=train_rmse_mean + train_rmse_std),lwd=1,col='#0078D7',lty=1) +
geom_line(aes(y=train_rmse_mean + train_rmse_std),lwd=1,col='#0078D7',lty=1) +
geom_line(aes(y=test_rmse_mean),lwd=1,col='orange') +
geom_line(aes(y=test_rmse_mean + test_rmse_std),lwd=1,col='orange',lty=2) +
geom_line(aes(y=test_rmse_mean - test_rmse_std),lwd=1,col='orange',lty=2)
The figure below shows the gain of each feature in the resultant model, trained on all of train0.
#----------------generate out-of-fold (oof) predictions-------------------------
#initialise the out-of-fold predictions vector
xgb_oof_preds <- rep(NA,nrow(train0))
for (i in 1:KFOLD) {
fold <- folds[[i]]
#prepare fold specific training data
train0_x_fold <- xgb_train0_x[-fold,]
train0_y_fold <- xgb_train0_y[-fold]
Dtrain0_fold <- xgb.DMatrix(data=as.matrix(train0_x_fold),label=train0_y_fold)
#prepare fold specific validation data
val0_x_fold <- xgb_train0_x[fold,]
val0_y_fold <- xgb_train0_y[fold]
Dval0_fold <- xgb.DMatrix(data=as.matrix(val0_x_fold),label=val0_y_fold)
#fit the model
model <- xgb.train(params=PARAMS,
data=Dtrain0_fold,
nrounds=OPTIMAL_NROUNDS,
verbose=0)
#save out of fold predictions
preds <- predict(model,newdata=Dval0_fold)
xgb_oof_preds[fold] <- preds
}
#compute out of fold error
xgb_oof_error <- sqrt(mean((xgb_oof_preds-xgb_train0_y )^2))
#---------------------- train model on all train0-------------------------------
#Retrain the model on all train0, this will be used to predict on the ensemble
#validation set, the predictions will be used to
#validate the ensembling meta model
xgb_model <- xgboost(data = as.matrix(xgb_train0_x),
label = xgb_train0_y ,
params = PARAMS,
verbose=FALSE,
nrounds = OPTIMAL_NROUNDS
)
#----------------------------feature importance---------------------------------
#Plot feature importances for this model
xgb_fi <- xgb.importance(model=xgb_model)
xgb.plot.importance(xgb_fi,measure='Gain',main='Feature Importance (gain)')
Now the xgboost model is used to predict on ensemble_validation for later ensembling.
#----------------create predictions for the ensemble fold-----------------------
#These predictions will be used to validate the ensembling meta model
xgb_ensemble_validation <- encode_type(ensemble_validation) %>% encode_method %>%
polarise(MELBOURNE_CENTRE,.)
xgb_ensemble_validation_y <- log(xgb_ensemble_validation$price)
xgb_ensemble_validation_x <- xgb_ensemble_validation %>%
select(xgb_features) %>% as.matrix
xgb_ens_preds <- predict(xgb_model,newdata <- xgb_ensemble_validation_x)
Below are individual conditional expectation (ICE) plots for the most relevant features, these are the same as partial dependence plots only they are disaggregated; each thin line shows the estimated impact on predicted log(price) the X variable has for each of the individual observations. By visualising data in this way, we better understand the reliability of the plot, and, can more easily identify interactions. If the thin lines behave differently for different values of the X variable, then X might be interacting with another variable. You can learn more about ICE plots here. The authors of this paper also wrote the R package ICEbox, used here. The highlighted line is the mean of the individual conditional expectation plots, which identical to a partial dependence plot.
Pay careful attention to the tickmarks down the bottom, detailing the distribution of data points along the X-axis.
plot_xgb_ice <- function(model,X,y,feature) {
colour <- rep(rgb(0,0,0,0.05),nrow(X))
ice_obj = ice(model,
X=as.matrix(X),
y=y,
predictor=which(feature == names(X)),
verbose=F,
frac_to_build=0.1)
plot(ice_obj,plot_orig_pts=F,colorvec=colour,xlab=feature)
}
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'building_area')
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'dist_cbd')
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'bearing')
plot_xgb_ice(xgb_model , xgb_train0_x,xgb_train0_y,'year_built')
check plot frac==1
K-Nearest Neighbours
Ensembling tends to perform better when combining models of a different structure. We’ve already fitted a tree based model, now a nearest neighbours model is fitted.
This model also makes use of kernel smoothing, the relevance of each neighbouring data point is scaled by the kernel, a decreasing function of distance. The kknn package is used, optimised with caret. The epanechnikov kernel was chosen ahead of time. The kknn package adjusts the kernel to appropriately weight each dimension of the data, so that the model is constant under shifts and dilations of the feature space; there is no need to scale or centre these data.
acknowledge limitations of CV in that i chose feats before CV, not to worry, it wont
#---------------------------prepare data----------------------------------------
knn_train0 <- encode_type(train0)
knn_features <- c('building_area'
,'lng'
,'lat'
,'year_built'
,'type_encoded'
,'nrooms'
,'land_area'
)
knn_train0_y <- log(knn_train0$price)
knn_train0_x <- knn_train0 %>% select(knn_features)
#---------------Prepare for caret's optimisation--------------------------------
#caret's trainControl function demands training indicies
#simply passing -folds won't work
train_folds <- lapply(folds,function(f){which(!(1:nrow(knn_train0) %in% f))})
#using folds created earlier, create an object to pass to caret
#for cross validation
trainingParams <- trainControl(index=train_folds,
indexOut=folds,
verbose=FALSE)
#create a dataframe of hyperparameter values to search
tunegrid <- data.frame(kmax=rep(1:30,each=1),
distance=rep(1,30),
kernel=rep('epanechnikov',30)
)
#-----------------------------Run Caret-----------------------------------------
#setup parallel computing cluster
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
#Find the optimal kmax hyperparameter
knn_model <- train(x=knn_train0_x,
y=knn_train0_y,
method='kknn',
metric='RMSE',
tuneGrid = tunegrid,
trControl = trainingParams
)
The figure below shows validation score as a function of the number of neighbouring points included. Based on this a neighbourhood of 20 is chosen.
#-------------------------------------------plot--------------------------------
res <- knn_model$results
res <- select(res,c(kmax,distance,RMSE,MAE))
ggplot(res,aes(x=kmax)) + geom_line(aes(y=RMSE),lwd=2,color='#0078D7')
#save optimal hyperparamaters
tuned <- data.frame(kmax=20,distance=1,kernel='epanechnikov')
#---------------Create out-of-fold (OOF) predictions----------------------------
knn_oof_preds <- rep(NA,nrow(knn_train0))
#iterate over folds
for (i in 1:KFOLD) {
fold <- folds[[i]]
#prepare fold specific data
train0_x_fold <- knn_train0_x[-fold,]
train0_y_fold <- knn_train0_y[-fold]
val0_x_fold <- knn_train0_x[fold,]
val0_y_fold <- knn_train0_y[fold]
#call caret's train() to find the optimal kmax
model <- train(x=train0_x_fold,
y=train0_y_fold,
method='kknn',
metric='RMSE',
tuneGrid=tuned,
trControl = trainControl('none')
)
#compute prediction on the remaining fold
preds <- predict(model$finalModel,newdata=val0_x_fold)
#add to the OOF predictions
knn_oof_preds[fold] <- preds
}
#stop the cluster
stopCluster(cl)
knn_oof_error <- sqrt(mean((knn_oof_preds-knn_train0_y)^2))
#--------------create predictions on the ensemble validation set----------------
knn_ensemble_validation <- encode_type(ensemble_validation)
knn_ensemble_validation_y <- log(knn_ensemble_validation$price)
knn_ensemble_validation_x <- knn_ensemble_validation %>% select(knn_features)
knn_ens_preds <- predict(knn_model,newdata=knn_ensemble_validation_x)
Generalised Additive Model
Now an additive model is fitted using the gam package. This model was fitted last as it relies the most on an understanding of the data.
The model expression was constructed manually, the result of first exploration and then tuning each splines’ degrees of freedom. Simplicity was favoured over complexity to avoid overfitting as a consequence of lengthy experimentation with the data.
caret was not used this time, it has a habit of oversimplifying the fitting interface.
#---------------------------prepare data----------------------------------------
train0_gam <- polarise(MELBOURNE_CENTRE,train0)
gc <- gam.control(maxit=100)
form <- formula(
log(price) ~
type + #this is a factor
nrooms +
nbathroom +
factor(nbathroom):factor(nrooms) +
s(building_area_log,df=15) +
s(dist_cbd,df=30) +
s(bearing,df=28) +
s(year_built,df=8) +
s(land_area_log,df=8) +
nsupermarket_2000 +
ntrain_3000
#+ nschool_2000
#+ ncafe_1000
)
#---------------------------Generate out of fold predictions--------------------
#initialise
gam_oof_preds <- rep(NA,nrow(train0))
for (i in 1:KFOLD) {
fold <- folds[[i]]
train0_gam_fold <- train0_gam[-fold,]
val0_gam_fold <- train0_gam[fold,]
gam_model <- gam(formula = form ,
data = train0_gam_fold ,
#control = gc,
family='gaussian'
)
#add out of fold predictions for this fold
preds <- predict(gam_model,newdata=val0_gam_fold)
gam_oof_preds[fold] <- preds
}
#compute error
gam_oof_error <- sqrt(mean((gam_oof_preds-log(train0_gam$price) )^2))
#----------retrain the full model to generate features fo rthe meta model-------
gam_model <- gam(formula = form ,
data = train0_gam ,
#control = gc,
family='gaussian'
)
#--------------Generate metafeatures for the ensembling meta-model--------------
#These predictions will be used to validate the ensembling meta model
gam_ensemble_validation <- encode_type(ensemble_validation) %>%
polarise(MELBOURNE_CENTRE,.)
gam_ens_preds <- predict(gam_model,newdata <- gam_ensemble_validation)